Live Demo

Before we do any sort of research into a data set we'll need to know what the goal is.

In this case I'm demoing how you can use Minhash calculation and text based similarities to discover potential phishing domains. This code is borrowed mostly from the phishy domain analyzer. This is to demonstrate how one can start doing some data analysis/research and end up with an analyzer.

The question we are trying to answer is: Are there domains in the data set that are "similar" to the most frequently visited domains, and thus have a chance to be phishing domains? The theory here being that the most commonly visited domains by the user are most likely "legit", and if there are anomalies in the data set, that is domains that are less frequently visited yet very similar to those that are frequently visited they could be malicious or phishing domains.

Imports

Simple imports of libraries that are going to be used throughout.


In [ ]:
import os
import re
import datetime
import getpass

import numpy as np
import pandas as pd
import altair as alt

import difflib 

from timesketch_api_client import client as timesketch_client

Caveat

If you are running this from jupyterlab, then you will not be able to run the first half, only the Timesketch portion.

Also you will not be able to import everything, so the first import statement will work fine, the second import statement is only for those running this locally.

Also if you are running from jupyterlab, then you'll need to run the pip command below (commented out for local run)


In [ ]:
#!pip install vega

Second import statement (only run if you are not using jupyterlab)


In [ ]:
from datasketch.minhash import MinHash
from six.moves import urllib_parse as urlparse

from timesketch.lib import similarity
from timesketch.lib.analyzers import utils

_ = alt.renderers.enable('notebook')

Read the Data

The way this would be done normally would be to store the data in timesketch and then query it.

In this case we really only care about entries that have a URL or a domain attribute set. The TS query would be:

        query = (
            '{"query": { "bool": { "should": [ '
            '{ "exists" : { "field" : "url" }}, '
            '{ "exists" : { "field" : "domain" }} ] } } }')

However, this notebook was used at a conference for a live demo, and as such we don't want to depend on internet connectivity, thus we read a JSON file on disk.

To get a copy of the JSON file, please reach out to kiddi at log2timeline dot net.


In [ ]:
user_running_notebook = getpass.getuser()

DATA_LOCATION = '/Users/{0:s}/Documents/Presentation/data/studentpc1.jsonl'.format(user_running_notebook)

The file itself is really large:


In [ ]:
!du -hs $DATA_LOCATION

Since all we care about are entries with URls or domains in them, we'll open up the file and read it line by line, just picking those and then entering them into a data frame.


In [ ]:
import json

data_entries = []

with open(DATA_LOCATION, 'r') as fh:
    for line in fh:
        parsed = json.loads(line)
        if 'url' in parsed or 'domain' in parsed:
            data_entries.append(parsed)

network_data = pd.DataFrame(data_entries)
del data_entries

Initial Exploration

No we've got a data frame with just our browsing data, let's look at it first to inspect before we make any other changes.


In [ ]:
network_data.shape

When doing research, or data exploration it really helps knowing what the data looks like, so we take a small sample of it to look at it.


In [ ]:
network_data.head(3)

Let's check whether we've got a domain column....


In [ ]:
if 'domain' in network_data.columns:
    print('we do have a domain column')
else:
    print('no such column exists....')

Manipulate The Data

Now we need to start manipulating or extracting the features we need.

We could also have just used regular expressions... eg r'https?://([^$/]+)' and the built-in network_data.url.str.extract(<regular_expression>) to extract the data. However here I'm using a library to extract the URL part, since it contains features for more types of URLs.


In [ ]:
def get_domain(url):
    domain_parsed = urlparse.urlparse(url)
    domain_full = domain_parsed.netloc
    domain, _, _ = domain_full.partition(':')
    return domain

def get_tld(domain):
    return '.'.join(domain.split('.')[-2:])

network_data['domain'] = network_data.url.apply(get_domain)
network_data['tld'] = network_data.domain.apply(get_tld)

The Meat

Now that we've got the domain we can start looking at whether there are similar domains, but first we need to figure out what are the most common domains, to see what the most likely "corporate" domains are.


In [ ]:
network_data.tld.value_counts()[:10].keys()

Let's create a list of domains to "watch"


In [ ]:
# We "bootstrap" the watch list with the known "corp" domains that we've got and we know about.
watch_list = ['greendale.xyz']

# Let's add the top domains to the list.
watch_list.extend(network_data.tld.value_counts()[:10].keys())
watch_list.extend(network_data.domain.value_counts()[:10].keys())

# Remove empty records...
watch_list = [x for x in watch_list if x]
watch_list

Let's now calculate the MinHash value for all of the domains on the watch list.

Start by defining few functions.


In [ ]:
# This is just some code from the phishy analyzer, not really something we go through here.
    # Added here to have the code to play with.
    domain_scoring_threshold = 0.75

    def _get_minhash_from_domain(domain):
        """Get the Minhash value from a domain name.
        This function takes a domain, removes the TLD extension
        from it and then creates a MinHash object from every
        remaining character in the domain.
        If a domain starts with www., it will be stripped of the
        domain before the Minhash is calculated.
        Args:
          domain: string with a full domain, eg. www.google.com
        Returns:
            A minhash (instance of datasketch.minhash.MinHash)
        """
        domain_items = domain.split('.')
        domain_part = '.'.join(domain_items[:-1])

        minhash = MinHash(similarity.DEFAULT_PERMUTATIONS)
        for char in domain_part:
            minhash.update(char.encode('utf8'))

        return minhash

    def _get_similar_domains(domain, domain_dict):
        """Compare a domain to a list of domains and return similar domains.
        This function takes a domain and a dict object that contains
        as key domain names and value the calculated MinHash value for that
        domain as well as the domains depth (mbl.is is 2, foobar.evil.com would
        be 3). It will then strip www. if needed from the domain, and compare
        the Jaccard distance between all domains in the dict and the supplied
        domain (removing the TLD extension from all domains).
        If the Jaccard distance between the supplied domain and one or more of
        the domains in the domain dict is higher than the configured threshold
        the domain is further tested to see if there are overlapping substrings
        between the two domains. If there is a common substring that is longer
        than half the domain name and the Jaccard distance is above the
        threshold the domain is considered to be similar.
        Args:
            domain: string with a full domain, eg. www.google.com
            domain_dict: dict with domain names (keys) and MinHash objects
                (values) for all domains to compare against.
        Returns:
            a list of tuples (score, similar_domain_name) with the names of
            the similar domains as well as the Jaccard distance between
            the supplied domain and the matching one.
        """
        domain = utils.strip_www_from_domain(domain)

        similar = []
        if '.' not in domain:
            return similar

        if domain in domain_dict:
            return similar

        if any(domain.endswith('.{0:s}'.format(x)) for x in domain_dict):
            return similar

        # We want to get rid of the TLD extension of the domain.
        # This is only used in the substring match in case the Jaccard
        # distance is above the threshold.
        domain_items = domain.split('.')
        domain_depth = len(domain_items)
        domain_part = '.'.join(domain_items[:-1])

        minhashes = {}
        for index in range(0, domain_depth - 1):
            minhashes[domain_depth - index] = _get_minhash_from_domain(
                '.'.join(domain_items[index:]))

        for watched_domain, watched_item in iter(domain_dict.items()):
            watched_hash = watched_item.get('hash')
            watched_depth = watched_item.get('depth')

            minhash = minhashes.get(watched_depth)
            if not minhash:
                # The supplied domains length does not match this watched
                # domain.
                continue
            score = watched_hash.jaccard(minhash)
            if score < domain_scoring_threshold:
                continue

            watched_domain_items = watched_domain.split('.')
            watched_domain_part = '.'.join(watched_domain_items[:-1])

            # Check if there are also any overlapping strings.
            sequence = difflib.SequenceMatcher(
                None, domain_part, watched_domain_part)
            match = sequence.find_longest_match(
                0, len(domain_part), 0, len(watched_domain_part))

            # We want to have at least half of the domain matching.
            # TODO: This can be improved, this is a value and part that
            # needs or can be tweaked. Perhaps move this to a config option
            # that is the min length of strings.
            match_size = min(
                int(len(domain_part)/2), int(len(watched_domain_part)/2))
            if match.size < match_size:
                continue
            similar.append((watched_domain, score))

        return similar

and to create the watched domain dict, which is an object that simply contains all the watched domains as keys and the value is the MinHash value and the depth. The depth is defined simply as the number of parts within the domain name, eg. www.mbl.is is 3 level deep, whereas cnn.com is 2, etc.


In [ ]:
watched_domains = {}
for domain in watch_list:
    minhash = _get_minhash_from_domain(domain)
    watched_domains[domain] = {
        'hash': minhash,
        'depth': len(domain.split('.'))
    }

Now that we've got the common domains, let's calculate the jaccard distance (actually minhash ) and similarities (overlapping strings) for all of the domains in the history.

Since what we do is to take all the letters in the domain name and add it to a set, which we then create the MinHash value from and use for comparison the domains abc.com and bacccab.moc will be considered the same (same letters in it), that's why we add the overlapping string check to make sure the domains are similar.

Now we can go through the domains we've got and find "similar" domains.


In [ ]:
entries = []
for domain in network_data.domain.unique():
    similar_domains = _get_similar_domains(domain, watched_domains)
    if not similar_domains:
        continue

    print('Domain: {0:s} does have similar domains discovered:'.format(domain))
    for similarities in similar_domains:
        s_domain, s_score = similarities
        print('    [{0:s}] - {1:0.2f}%'.format(s_domain, s_score * 100))
        entry = {'domain': domain, 'watched_domain': s_domain, 'score': s_score}
        entries.append(entry)
    print('---')
similar_domains = pd.DataFrame(entries)

This can also be looked at visually.

As a table:


In [ ]:
similar_domains.sort_values('score', ascending=False)

So we see that someone is visiting a URL that looks very similar to our corp URL...

This could trigger a detection.

or a chart (might not be the best representation) ...


In [ ]:
similar_domains['percentage'] = similar_domains.score * 100
alt.Chart(similar_domains).mark_point().encode(
    y='domain', x='watched_domain', size='percentage',
    color='watched_domain', fill='watched_domain',
    tooltip=['watched_domain','domain', 'percentage'],
).properties(
    width=600, height=400,
    title='Similar Domains'
)

Timesketch Demo

Small TS demo. Let's start by connecting to our demo server.


In [ ]:
client = timesketch_client.TimesketchApi('https://demo.timesketch.org', 'demo', 'demo')

Then we can take a look at what sketches are available, and let's pick one of them.


In [ ]:
for index, sketch in enumerate(client.list_sketches()):
    print('[{0:d}] ID: {1:d} - {2:s} <{3:s}>'.format(index, sketch.id, sketch.name, sketch.description))

In this case we want to get the data from the Greendale incident.


In [ ]:
sketch = client.get_sketch(238)

Let's start to check whether or not there are saved views.


In [ ]:
lines = []
for view in sketch.list_views():
    view_dict = {
        'id': view.id,
        'name': view.name}
    lines.append(view_dict)

pd.DataFrame(lines)

Let's look at what browser searches were made.


In [ ]:
view = sketch.get_view(view_id=2010)

results = sketch.explore(
    view=view, as_pandas=True,
    return_fields='datetime,message,timestamp_desc,source,source_short,label,tag,tags,url,domain,search_string')

Let's look at some of the browser searches.


In [ ]:
results.search_string.value_counts()

We can take a look at the frequency of search queries.


In [ ]:
results['date'] = pd.to_datetime(results.datetime)
results['day'] = results.date.dt.strftime('%Y%m%d')
r_group = results[['day', 'datetime']].groupby(by='day', as_index=False)
r_count = r_group.count()
r_count['count'] = r_count['datetime']
del r_count['datetime']

alt.Chart(r_count, width=500, height=300).mark_line(point=True).encode(
    x='day',
    y='count',
    tooltip=['day', 'count'],
).properties(
    title='Search Queries per day'
)

Problem with this is the fact that there are days with no search queries, that make the chart look a bit odd. Let's fill the values so that we have each day represented in the chart.


In [ ]:
# Let's skip the first day, since that was in the year 2000....
r = r_count[1:]

# We need to find out the first and last day.
r_first = r.day.values[0]
r_end = r.day.values[-1]

# Let's create a timestamp from the first day.
year = r_first[0:4]
month = r_first[4:6]
day = r_first[6:]
first_day = datetime.datetime(year=int(year), month=int(month), day=int(day))

# Now we can create a list of all dates between first and last date.
all_days = []
cur_day = first_day
while True:
    day_string = cur_day.strftime('%Y%m%d')
    if day_string == r_end:
        break
    all_days.append(day_string)
    cur_day = cur_day + datetime.timedelta(days=1)

# Let's create sets, one for all days, the second for the current days.
cur_days = set([str(x) for x in r.day.values])
all_days = set(all_days)
# Now we can easily find a list of all missing dates.
missing_days = all_days.difference(cur_days)

# Let's create a data frame that contains just the missing dates, with a count of zero.
lines = []
for day in missing_days:
    line = {'day': str(day), 'count': 0}
    lines.append(line)
    
df = pd.DataFrame(lines)

Now we can repeat what we did before, just this time with filled values.


In [ ]:
r_filled = r.append(df, sort=True)

alt.Chart(r_filled, width=900, height=300).mark_line(point=True).encode(
    x=alt.X('day', axis=alt.Axis(title='Day of Query')),
    y='count',
    tooltip=['day', 'count'],
).properties(
    title='Search Queries per day'
)

We can also just look at saved aggregations...


In [ ]:
aggregations = []
for index, agg in enumerate(sketch.list_aggregations()):
    print('[{0:d}] - {1:s}, {2:s} <{3:s}> -> {4:s}'.format(index, agg.name, agg.aggregator_name, agg.type, agg.chart_type))
    aggregations.append(agg)

They are more or less the same... we can take a look at one of them.

As a table:


In [ ]:
aggregations[3].table

Or as a chart:


In [ ]:
aggregations[3].chart

Now we shall do something different. Let's look at all executions that are recorded from our prefetch parser. Once we've get these records we can then search for whether they appear somewhere in our web history.. that is whether there is a link between prefetch and browser history (this will miss ZIP files or other compressed files that get downloaded)


In [ ]:
data = sketch.explore(
    query_string='parser:"prefetch"', as_pandas=True,
    return_fields='datetime,timestamp_desc,source,source_short,message,executable')

data.shape

Now that we've got all the prefetch data, let's look at all executables and do a query for browser traffic that contains those names.


In [ ]:
browser_traffic = pd.DataFrame()

for executable in data.executable.unique():
    if not executable:
        continue
    if executable is np.nan:
        continue
    if not executable.lower().endswith('.exe'):
        continue
    exec_data = sketch.explore(
        query_string='url:"*{0:s}*"'.format(executable.lower()),
        return_fields='datetime,timestamp_desc,source,source_short,message,domain,url',
        as_pandas=True)
    browser_traffic = pd.concat([browser_traffic, exec_data])

Let's combine the two data frames, that is the browser traffic and the prefetch data. We'll then sort it by dates.


In [ ]:
combined = pd.concat([browser_traffic, data], sort=False)
c_sorted = combined.sort_values(by='datetime')
c_sorted['date'] = pd.to_datetime(c_sorted.datetime)

First of all, let's just look at VPN.exe:


In [ ]:
c_sorted[c_sorted.message.str.contains('vpn.exe', case=False)][['date', 'timestamp_desc', 'message', 'url', 'executable', 'domain']][2:]

And to examine all the hits:


In [ ]:
c_sorted[['date', 'timestamp_desc', 'message', 'url', 'executable']]

We can also extract the executable from the web history.


In [ ]:
import six

def check_row(row):
    executable = row.executable
    if isinstance(executable, six.text_type):
        if not executable:
            return np.nan
        return executable.lower()
    url = row.url
    if not isinstance(url, six.text_type):
        return np.nan
    _, _, last_part = url.rpartition('/')
    if '&' in last_part:
        last_part, _, _ = last_part.partition('&')
    if '#' in last_part:
        last_part, _, _ = last_part.partition('#')
        
    if not last_part:
        return np.nan
    return last_part

c_sorted['new_exec'] = c_sorted.apply(check_row, axis=1)

Now we can look at the most common applications


In [ ]:
c_sorted['day'] = c_sorted.date.dt.strftime('%Y%m%d')
c_sorted_data = c_sorted[~c_sorted.new_exec.isna()]

c_sorted_data.new_exec.value_counts()[:10]

And then look at vpn.exe


In [ ]:
c_sorted_data[c_sorted_data.new_exec == 'vpn.exe']

Or we can do this as a chart...


In [ ]:
c_group = c_sorted[['day', 'date', 'new_exec']].groupby(by=['new_exec','day'], as_index=False)
c_count = c_group.count()
c_count['count'] = c_count.date
c_count['executable'] = c_count['new_exec']
del c_count['date']

alt.Chart(c_count).mark_point(filled=True).encode(
    x='day',
    y='executable',
    size='count',
)

OK.. these are all... but what about those executables that appear BOTH in a URL and Prefetch.

To be able to do that, we'll first add the executable field to the two data frames from before.


In [ ]:
browser_traffic.head(10)

def get_exec(url):
    items = [x.lower() for x in url.split('/')]
    executable = ''
    for item in items:
        if not '.exe' in item:
            continue
        executable = item
    if '#' in executable:
        executable, _, _ = executable.partition('#')
    if '&' in executable:
        executable, _, _ = executable.partition('&')
    if '\\' in executable:
        _, _, executable = executable.rpartition('\\')
    return executable

browser_traffic['new_exec'] = browser_traffic.url.apply(get_exec)

def get_exec(exec_string):
    if not isinstance(exec_string, six.text_type):
        return exec_string
    return exec_string.lower()

data['new_exec'] = data.executable.apply(get_exec)

Now we can merge the two data frames.


In [ ]:
merged_df = data.merge(browser_traffic, on='new_exec')

And then we can do the same as we did above.


In [ ]:
merged_df['date'] = pd.to_datetime(merged_df['datetime_x'])
merged_df['day'] = merged_df.date.dt.strftime('%Y%m%d')
m_group = merged_df[['day', 'date', 'new_exec']].groupby(by=['new_exec','day'], as_index=False)
m_count = m_group.count()
m_count['count'] = m_count.date
m_count['executable'] = m_count['new_exec']
del m_count['date']

alt.Chart(m_count, width=300, height=200).mark_point(filled=True).encode(
    x='day',
    y='executable',
    size='count',
    color='executable'
).properties(
    title='Executables that appear both in Prefetch and Internet history'
)

Search by label.

Let's look at some other part of the API, let's look at starred events.


In [ ]:
sketch.search_by_label('__ts_star', as_pandas=True)